In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser

First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use 311-2014.csv. You can rename it.

Importing and preparing your data

Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.


In [2]:
import datetime
import datetime as dt

dt.datetime.strptime('07/06/2015 10:58:27 AM', '%m/%d/%Y %I:%M:%S %p')
#datetime.datetime(2015, 7, 6, 0, 0)
parser = lambda date: pd.datetime.strptime(date, '%m/%d/%Y %H:%M:%S')

In [3]:
df = pd.read_csv("311-2015.csv", low_memory=False, parse_dates=[1], dtype=str , nrows=200000)

In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 53 columns):
Unique Key                        200000 non-null object
Created Date                      200000 non-null datetime64[ns]
Closed Date                       188913 non-null object
Agency                            200000 non-null object
Agency Name                       200000 non-null object
Complaint Type                    200000 non-null object
Descriptor                        198197 non-null object
Location Type                     179328 non-null object
Incident Zip                      181049 non-null object
Incident Address                  152173 non-null object
Street Name                       152152 non-null object
Cross Street 1                    108035 non-null object
Cross Street 2                    107583 non-null object
Intersection Street 1             24790 non-null object
Intersection Street 2             24530 non-null object
Address Type                      177091 non-null object
City                              181095 non-null object
Landmark                          127 non-null object
Facility Type                     80031 non-null object
Status                            199998 non-null object
Due Date                          152018 non-null object
Resolution Description            198936 non-null object
Resolution Action Updated Date    188529 non-null object
Community Board                   200000 non-null object
Borough                           200000 non-null object
X Coordinate (State Plane)        175825 non-null object
Y Coordinate (State Plane)        175825 non-null object
Park Facility Name                200000 non-null object
Park Borough                      200000 non-null object
School Name                       200000 non-null object
School Number                     199907 non-null object
School Region                     197128 non-null object
School Code                       197128 non-null object
School Phone Number               200000 non-null object
School Address                    200000 non-null object
School City                       200000 non-null object
School State                      200000 non-null object
School Zip                        199999 non-null object
School Not Found                  151897 non-null object
School or Citywide Complaint      0 non-null object
Vehicle Type                      34 non-null object
Taxi Company Borough              434 non-null object
Taxi Pick Up Location             3680 non-null object
Bridge Highway Name               1960 non-null object
Bridge Highway Direction          1959 non-null object
Road Ramp                         1946 non-null object
Bridge Highway Segment            2134 non-null object
Garage Lot Name                   143 non-null object
Ferry Direction                   86 non-null object
Ferry Terminal Name               215 non-null object
Latitude                          175825 non-null object
Longitude                         175825 non-null object
Location                          175825 non-null object
dtypes: datetime64[ns](1), object(52)
memory usage: 80.9+ MB

In [5]:
df.index = df['Created Date']

In [6]:
del df['Created Date']

In [7]:
df.head(2)


Out[7]:
Unique Key Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name ... Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
Created Date
2015-07-06 10:58:27 31015465 07/22/2015 01:07:20 AM DCA Department of Consumer Affairs Consumer Complaint Demand for Cash NaN 11360 27-16 203 STREET 203 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.773539552542 -73.78823697228408 (40.773539552542, -73.78823697228408)
2015-07-03 13:26:29 30997660 07/03/2015 02:08:20 PM NYPD New York City Police Department Vending In Prohibited Area Residential Building/House 10019 200 CENTRAL PARK SOUTH CENTRAL PARK SOUTH ... NaN NaN NaN NaN NaN NaN NaN 40.76702142171206 -73.97944780718524 (40.76702142171206, -73.97944780718524)

2 rows × 52 columns

What was the most popular type of complaint, and how many times was it filed?


In [8]:
df['Complaint Type'].value_counts().head(1)


Out[8]:
Blocked Driveway    21779
Name: Complaint Type, dtype: int64

Make a horizontal bar graph of the top 5 most frequent complaint types.


In [9]:
ax = df['Complaint Type'].value_counts().sort_values(ascending=True).tail(5).plot(kind='barh', figsize=(6,4), fontsize=9)
ax.set_title("Top 5 Most Frequent 311-Complaints in 2015")
ax.set_xlabel("Complaint Count")
ax.set_ylabel("Complaint Type")
plt.savefig("5 Most Frequent 311 Complaints in 2015.svg", bbox_inches='tight')
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html


Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.


In [10]:
df_summed_complaints = df['Borough'].value_counts()
summed_complaints = pd.DataFrame(df_summed_complaints)
summed_complaints.reset_index(inplace=True)
summed_complaints.columns = ['Borough', 'Complaint Count']

In [11]:
Borough_head_count = pd.read_csv("NYC_Boroughs.csv")

In [12]:
summed_complaints_merged = summed_complaints.merge(Borough_head_count, left_on='Borough', right_on='borough name')

In [13]:
del summed_complaints_merged['borough name']

In [14]:
summed_complaints_merged


Out[14]:
Borough Complaint Count abbreviation Total
0 BROOKLYN 57129 K 983769
1 QUEENS 46824 Q 1134064
2 MANHATTAN 42050 NY 748583
3 BRONX 29610 BX 456847
4 Unspecified 17000 NaN 0
5 STATEN ISLAND 7387 R 304250

In [15]:
summed_complaints_merged['Per Capita'] = summed_complaints_merged['Total'] / summed_complaints_merged['Complaint Count']

In [16]:
summed_complaints_merged['Per Capita'].sort_values(ascending=False)
summed_complaints_merged[['Borough', 'Per Capita']]


Out[16]:
Borough Per Capita
0 BROOKLYN 17.220133
1 QUEENS 24.219716
2 MANHATTAN 17.802212
3 BRONX 15.428808
4 Unspecified 0.000000
5 STATEN ISLAND 41.187221

In [17]:
Sorted_complaints = summed_complaints_merged.sort_values(by='Per Capita', ascending=False)

In [18]:
Sorted_complaints


Out[18]:
Borough Complaint Count abbreviation Total Per Capita
5 STATEN ISLAND 7387 R 304250 41.187221
1 QUEENS 46824 Q 1134064 24.219716
2 MANHATTAN 42050 NY 748583 17.802212
0 BROOKLYN 57129 K 983769 17.220133
3 BRONX 29610 BX 456847 15.428808
4 Unspecified 17000 NaN 0 0.000000

According to your selection of data, how many cases were filed in March? How about May?


In [19]:
df['2015-03']['Unique Key'].count()


Out[19]:
15025

In [20]:
df['2015-05']['Unique Key'].count()


Out[20]:
49715

I'd like to see all of the 311 complaints called in on April 1st.

Surprise! We couldn't do this in class, but it was just a limitation of our data set


In [21]:
df['2015-04-01']['Unique Key'].count()


Out[21]:
573

What was the most popular type of complaint on April 1st?


In [22]:
df['2015-04-01']['Complaint Type'].value_counts().head(1)


Out[22]:
Illegal Parking    67
Name: Complaint Type, dtype: int64

What were the most popular three types of complaint on April 1st


In [23]:
df['2015-04-01']['Complaint Type'].value_counts().head(3)


Out[23]:
Illegal Parking     67
Street Condition    64
Blocked Driveway    58
Name: Complaint Type, dtype: int64

What month has the most reports filed? How many? Graph it.


In [24]:
#PANDAS resample: http://stackoverflow.com/questions/17001389/pandas-resample-documentation/17001474#17001474

In [25]:
ax = df.resample('M')['Unique Key'].count().plot(kind='barh')
#ax.set_yticks(['2015-01-31 00:00:00, 2015-02-28 00:00:00, 2015-03-30 00:00:00, 2015-04-30 00:00:00, 2015-05-31 00:00:00, 2015-06-30 00:00:00, 2015-07-31 00:00:00, 2015-08-31 00:00:00, 2015-09-30 00:00:00, 2015-10-31 00:00:00, 2015-11-30 00:00:00, 2015-12-31 00:00:00'])
ax.set_yticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.set_title('Number of Complaints per Month')
ax.set_ylabel('Months of the Year')
ax.set_xlabel('Complaint Counts')


Out[25]:
<matplotlib.text.Text at 0x107ce78d0>

What week of the year has the most reports filed? How many? Graph the weekly complaints.


In [26]:
df_week_count = df.resample('W')['Unique Key'].count()

In [27]:
df_week_count = pd.DataFrame(df_week_count)

In [28]:
df_week_count.sort_values(by='Unique Key', ascending=False).head(5)


Out[28]:
Unique Key
Created Date
2015-05-10 13559
2015-05-17 11683
2015-05-24 10351
2015-05-03 10184
2015-05-31 9387

Noise complaints are a big deal. Use .str.contains to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).


In [29]:
df['Noise'] = df['Complaint Type'].str.contains('Noise')
df_noise = df[df['Noise'] == True]

In [30]:
ax = df_noise.resample('D')['Unique Key'].count().plot(kind='bar', figsize=(15,4))
ax.set_xticklabels('')
ax.set_title('Number of Noise Complaints per Day in 2015')
ax.set_ylabel('Noise Complaint Count')
ax.set_xlabel('From January - December 2015 ')


Out[30]:
<matplotlib.text.Text at 0x1086cfe80>

In [31]:
df_noise.groupby(by=df_noise.index.hour)['Unique Key'].count().plot(kind='bar', figsize=(12,6))


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x107dc2c50>

Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.


In [32]:
df_day_count = df.resample('D')['Unique Key'].count()
df_day_count = pd.DataFrame(df_day_count)
ax = df_day_count.sort_values(by='Unique Key', ascending=True).tail(5).plot(kind='barh', legend=False)
ax.set_yticklabels(['10. Oktober', '6. August', '3. August', '25. September', '19. Oktober'])
ax.set_title('Top five complaint days in 2015')
ax.set_ylabel('')
ax.set_xlabel('Complaint Counts')
plt.savefig("Top 5 Complaint Days", bbox_inches='tight')


What hour of the day are the most complaints? Graph a day of complaints.


In [33]:
ax = df.groupby(by=df.index.hour)['Unique Key'].count().plot(kind='bar', figsize=(12,6))
ax.set_title('Number of Complaints per hour in 2015')


Out[33]:
<matplotlib.text.Text at 0x10965bf98>

One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?


In [34]:
df.groupby(by=df.index.hour)['Complaint Type'].value_counts().head(2)


Out[34]:
   Complaint Type
0  HEAT/HOT WATER    4534
   Rodent            2112
Name: Complaint Type, dtype: int64

In [35]:
df_complaint_type_count_per_hour = df.groupby(by=df.index.hour)['Complaint Type'].value_counts()

In [36]:
Top_complaints_by_hour = pd.DataFrame(df_complaint_type_count_per_hour)

In [37]:
Top_complaints_by_hour['Complaint Type'][0].head(1)


Out[37]:
Complaint Type
HEAT/HOT WATER    4534
Name: Complaint Type, dtype: int64

In [38]:
Top_complaints_by_hour['Complaint Type'][1].head(1)


Out[38]:
Complaint Type
Noise - Commercial    1025
Name: Complaint Type, dtype: int64

In [39]:
Top_complaints_by_hour['Complaint Type'][23].head(1)


Out[39]:
Complaint Type
Noise - Street/Sidewalk    1599
Name: Complaint Type, dtype: int64

In [40]:
#More Reading: http://pandas.pydata.org/pandas-docs/version/0.13.1/timeseries.html

So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.


In [41]:
df.groupby(by=df.index.hour == 0)['Unique Key'].count()


Out[41]:
False    177573
True      22427
Name: Unique Key, dtype: int64

Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).


In [42]:
df['Agency'].value_counts().head(5)


Out[42]:
NYPD     80000
HPD      39388
DOT      22308
DPR      15505
DOHMH     8250
Name: Agency, dtype: int64

In [43]:
df['Agency Name'].value_counts().head(5)


Out[43]:
New York City Police Department                       80000
Department of Housing Preservation and Development    39363
Department of Transportation                          22308
Department of Parks and Recreation                    15128
Department of Health and Mental Hygiene                8216
Name: Agency Name, dtype: int64

In [44]:
df_NYPD = df[df['Agency Name'] == 'New York City Police Department']
df_NYPD.groupby(by=df_NYPD.index.hour)['Unique Key'].count().plot(kind='bar')


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1099357f0>

In [45]:
df_HPD = df[df['Agency Name'] == 'Department of Housing Preservation and Development']
df_HPD.groupby(by=df_HPD.index.hour)['Unique Key'].count().plot(kind='bar')


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a51aa90>

In [46]:
df_DOT = df[df['Agency Name'] == 'Department of Transportation']
df_DOT.groupby(by=df_DOT.index.hour)['Unique Key'].count().plot(kind='bar')


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x1096051d0>

In [47]:
df_DPR = df[df['Agency Name'] == 'Department of Parks and Recreation']
df_DPR.groupby(by=df_DPR.index.hour)['Unique Key'].count().plot(kind='bar')


Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a745048>

In [48]:
df_DOHMH = df[df['Agency Name'] == 'Department of Health and Mental Hygiene']
df_DOHMH.groupby(by=df_DOHMH.index.hour)['Unique Key'].count().plot(kind='bar')


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b8af9e8>

df_NYPD = df[df['Agency Name'] == 'New York City Police Department'] df_NYPD.groupby(by=df_NYPD.index.hour)['Unique Key'].count().plot(kind='bar')

Dont understand why I can't change the labels here:


In [49]:
ax = df[df['Agency Name'] == 'New York City Police Department'].groupby(by=df_NYPD.index.hour)['Unique Key'].count().plot(kind='bar', stacked=True, figsize=(15,6))
df[df['Agency Name'] == 'Department of Housing Preservation and Development'].groupby(by=df_HPD.index.hour)['Unique Key'].count().plot(kind='bar', ax=ax, stacked=True, color='lightblue')
df[df['Agency Name'] == 'Department of Transportation'].groupby(by=df_DOT.index.hour)['Unique Key'].count().plot(kind='bar', ax=ax, stacked=True, color='purple')
df[df['Agency Name'] == 'Department of Health and Mental Hygiene'].groupby(by=df_DOHMH.index.hour)['Unique Key'].count().plot(kind='bar', ax=ax, stacked=True, color='grey')
df[df['Agency Name'] == 'Department of Parks and Recreation'].groupby(by=df_DPR.index.hour)['Unique Key'].count().plot(kind='bar', ax=ax, stacked=True, color='green')
ax.set_title('Time of Day Agencies file complaints')
ax.set_ylabel('Complaint Count')
ax.set_xlabel("""
              Red: New York City Police Department
              Blue: Department of Housing Preservation and Development
              Purple: Department of Transportation
              Grey: Department of Health and Mental Hygiene
              Green: Department of Parks and Recreation""")
plt.savefig("Time of Day Agencies File Complaints.svg", bbox_inches='tight')


Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?


In [50]:
ax = df[df['Agency Name'] == 'New York City Police Department'].resample('W')['Agency'].count().plot(figsize=(15,4), linewidth=3)
df[df['Agency Name'] == 'Department of Housing Preservation and Development'].resample('W')['Agency'].count().plot(color='lightblue', linewidth=2)
df[df['Agency Name'] == 'Department of Transportation'].resample('W')['Agency'].count().plot(color='purple', linewidth=2)
df[df['Agency Name'] == 'Department of Health and Mental Hygiene'].resample('W')['Agency'].count().plot(color='grey', linewidth=2)
df[df['Agency Name'] == 'Department of Parks and Recreation'].resample('W')['Agency'].count().plot(color='green', linewidth=2)
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.set_title('When the Angencies Filed the Reports 2015')
ax.set_xlabel("""
              Red: New York City Police Department
              Blue: Department of Housing Preservation and Development
              Purple: Department of Transportation
              Grey: Department of Health and Mental Hygiene
              Green: Department of Parks and Recreation""")
plt.savefig("When the Angencies Filed the Reports 2015.svg", bbox_inches='tight')



In [51]:
ax = df_NYPD.resample('W')['Agency'].count().plot(kind='bar', figsize=(15,4))
ax.set_xticklabels([''])
ax.set_xlabel('January to December 2015')


Out[51]:
<matplotlib.text.Text at 0x10991d2b0>

Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.


In [52]:
df['2015-08']['Complaint Type'].value_counts().head(3)


Out[52]:
Noise - Street/Sidewalk    1467
Illegal Parking            1456
Blocked Driveway           1369
Name: Complaint Type, dtype: int64

In [53]:
df['2015-07']['Complaint Type'].value_counts().head(3)


Out[53]:
Illegal Parking            1988
Blocked Driveway           1889
Noise - Street/Sidewalk    1698
Name: Complaint Type, dtype: int64

In [54]:
df['2015-05']['Complaint Type'].value_counts().head(3)


Out[54]:
Blocked Driveway    4114
Illegal Parking     3975
HEAT/HOT WATER      3583
Name: Complaint Type, dtype: int64

In [ ]: